Skip to content

Fix cartpole-camera frame-stacking performance regression#5849

Merged
pbarejko merged 14 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf
Jun 5, 2026
Merged

Fix cartpole-camera frame-stacking performance regression#5849
pbarejko merged 14 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

Conversation

@jmart-nv

@jmart-nv jmart-nv commented May 28, 2026

Copy link
Copy Markdown

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

  1. Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
  2. Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
  3. Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
  4. Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
  5. Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change Why
uint8 ring buffer. Defer the x/255 − mean normalize past the frame-stack buffer for RGB-like data types. Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New CircularBuffer.stack_dim arg + .stacked property arrange storage so the channel-stacked output is a contiguous view of internal storage. Eliminates one (B, H, W, K*C)-sized allocation and a permute kernel per env.step.
Drop torch.roll on append. Replace with a front-to-back memmove. One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (normalize_image_uint8). One kernel: (uint8 / 255) − per-(batch, channel) mean → float32. Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (spatial_sum_uint8_tiled). Warp kernel writes (B, NUM_TILES, C) int32 partials along H; PyTorch collapses the partial dim. int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared isaaclab.utils.images.normalize_camera_image dispatch. One helper used by both image() (DirectRLEnv) and stacked_image.__call__ (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise. Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
image(clone=False) opt-out. Callers that immediately copy into their own storage skip the defensive clone. Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. _get_observations allocates the normalize output fresh each call (no persistent out= buffer). The RL trainer holds a reference to the previous-iteration observation until record_transition returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, seed 42, on NVIDIA L40. Perf via non_rl measured over n=3 warm trials:

Stage FPS step_ms Peak VRAM Mean reward
Pre-framestack baseline 54.62 18.67 3.13 GB ~102 (unsolvable — no temporal info)
Frame stacking from #5574 31.72 32.07 8.37 GB 297.68
This PR (pre-rebase) 67.87 15.32 4.41 GB 297.60
This PR (post-rebase) 69.50 14.94 4.57 GB 293.43

Screenshots

pr_figure_v2

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Incremental Update (a86c6a7)

The author addressed review feedback in commit a86c6a7:

Docstring improvementsstack_dim parameter now includes a concrete example (stack_dim=-1 on (B, H, W, C)(B, H, W, K*C)) making the API much more intuitive.

In-place subtract in normalize fallbackimages.py now uses images -= torch.mean(...) instead of allocating a new tensor, addressing the memory optimization suggestion.

Cache growth documented — The _uint8_sum_partials_cache comment now states it "Typically holds one entry per training run (camera resolution and device are fixed)," making the bounded growth clear.

Tile size rationale_UINT8_SUM_TILE_HW comment now documents it was "Tuned on L40; robust across modern NVIDIA arches (Ampere/Ada/Hopper) at R256," addressing the hardcoded value concern.

Shift loop documented — Added "Cheap at the typical frame-stack K=2-4" comment clarifying expected usage.

Minor fix — Corrected TiledCameraCfgCameraCfg in is_rgb_like docstring.

Post-rebase Update (0b0d0ba)

Branch was rebased onto latest main. The rebase commit contains only mechanical cleanup — precommit formatting, variable renames aligned with upstream refactors (e.g. _default_env_origins_default_env_pose), and changelog corrections. No logic changes to the frame-stacking or normalization code.

Test Coverage

Excellent coverage — comprehensive tests for the circular buffer stack_dim mode, normalize dispatch paths, Warp kernel correctness, and a consecutive-call aliasing regression test.

Verdict

Approve — All previous review concerns were addressed. The rebase commit is mechanical cleanup only. Ready to merge.


Incremental Update (5b3e944)

Release-prep commit rebasing onto the v6.3.0 release train. Changes relevant to this PR:

Version bumps — isaaclab 6.2.1→6.3.0, isaaclab_rl 0.5.3→0.6.0, isaaclab_tasks 2.0.1→2.0.2, isaaclab_newton 0.14.1→0.15.0, isaaclab_contrib 0.4.1→0.4.2, isaaclab_experimental 0.1.0→0.1.1. Changelogs rolled from changelog.d/ fragments into CHANGELOG.rst.

Changelog fragments consumed — The per-PR .rst/.skip fragments from this PR and others merged into the release are now removed; their content lives in the main changelogs.

ls_parallel deprecation landedMJWarpSolverCfg.ls_parallel is now formally deprecated (emits DeprecationWarning, force-set to False). All task configs, tests, and docs drop the field. This is a related-but-independent cleanup that happened to land in the same batch.

wrap_warp_to_torch.py removed — The obsolete migration script is deleted; a changelog entry documents the removal.

CI pinningovphysx==0.4.13 pinned in CI workflows and pyproject.

Lazy-export fixisaaclab.envs.__init__ and isaaclab.sensors.ray_caster.__init__ switch from eager submodule imports to pure lazy_export(), matching the warp-safe pattern from the experimental package fix.

Docs polish — MJWarp solver docs updated (removed ls_parallel references, clarified ls_iterations semantics), docs/_extensions/isaaclab_docs.py picks up smv_current_version.

Spot velocity task — Newton config expanded with collision pipeline and shape margin settings (unrelated to this PR's camera perf work).

Benchmark scriptbenchmark_startup.py now wraps the first env.step in torch.inference_mode().

Camera demo subtitle fix — Corrected mismatched subtitles lists in scripts/demos/sensors/cameras.py.

License-check workflow modernized — Switched from pip to uv (astral-sh/setup-uv).

No changes to the core frame-stacking / normalization logic in CircularBuffer, stacked_image, normalize_camera_image, or the Warp kernel code. All test files updated only to pass normalize=False/clone=True kwargs matching the new API signatures — test logic is unchanged.

Verdict

Approve — Release packaging only; the performance fix logic is untouched.

Comment thread source/isaaclab/isaaclab/utils/images.py Outdated
Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py
Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py
jmart-nv added a commit to jmart-nv/IsaacLab that referenced this pull request Jun 1, 2026
…y-up)

- images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class)
- images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path)
- circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop
- ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32
@jmart-nv jmart-nv marked this pull request as ready for review June 3, 2026 14:05
@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR rewrites the frame-stacking hot path in cartpole-camera to fix a 42% throughput regression and 2.7× VRAM inflation introduced by #5574, achieving 24% faster throughput than the pre-framestack baseline.

  • CircularBuffer gets a stack_dim mode that arranges internal storage so .stacked returns a free contiguous reshape of K frames merged along the configured axis, eliminating a permute + reshape + clone allocation per step; torch.roll is replaced with a front-to-back narrow().copy_() loop and a CPU-side _need_reset flag skips the torch.any GPU sync on the steady-state path.
  • New Warp kernels (spatial_sum_uint8_tiled + normalize_image_uint8) fuse the uint8 / 255 - per-channel mean -> float32 normalize into a single launch; the PyTorch-level partial-sum collapse uses dtype=torch.int64 to prevent int32 overflow at large resolutions.
  • normalize_camera_image centralises dispatch and replaces five inline normalize sites; the defer_normalize pattern holds the ring buffer in uint8 until after stacking, shrinking ring memory and per-step copy bandwidth 4x.

Confidence Score: 5/5

Safe to merge. The hot-path rewrite is well-scoped, all five affected code paths have matching unit tests, and the aliasing hazard is explicitly documented and correctly avoided by allocating a fresh float32 tensor per step.

Every key invariant is backed by a targeted test: the uint8 ring stores raw frames, consecutive normalize calls return independent storage, the delay buffer clone removal preserves independence, and the is_first_push warmup fills all K slots. The int32 partial-sum overflow concern raised on an earlier version is addressed with dtype=torch.int64 in the final collapse. No correctness issues were found.

No files require special attention.

Important Files Changed

Filename Overview
source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Adds stack_dim mode with contiguous stacked view, replaces torch.roll with a forward memmove loop, and introduces a CPU-side _need_reset flag to gate the is_first_push GPU sync; logic is correct and well-tested.
source/isaaclab/isaaclab/utils/warp/ops.py Adds _uint8_spatial_mean (cached partials scratch) and normalize_image_uint8 wrappers; partials.sum(dim=1, dtype=torch.int64) prevents int32 overflow at large resolutions, cache key includes channel_dim for BCHW/BHWC disambiguation.
source/isaaclab/isaaclab/utils/warp/kernels.py Adds normalize_image_uint8 and spatial_sum_uint8_tiled Warp kernels; per-tile int32 accumulation is safe at any common camera resolution and the final collapse uses int64.
source/isaaclab/isaaclab/utils/images.py New normalize_camera_image dispatch function; correctly routes uint8 contiguous 4D inputs to the Warp fast path, non-uint8/strided to the PyTorch fallback, depth/normals/passthrough handled consistently.
source/isaaclab/isaaclab/envs/mdp/observations.py stacked_image uses stack_dim=-1 for a free contiguous reshape, defers normalize past the uint8 ring for RGB-like data, eliminates the permute+reshape+clone allocation.
source/isaaclab_tasks/isaaclab_tasks/core/cartpole/cartpole_direct_camera_env.py Switches to stack_dim=1 (BCHW) and defers normalize past the uint8 ring; replaces stacked.reshape+clone with .stacked + normalize_camera_image.
source/isaaclab/isaaclab/utils/buffers/delay_buffer.py Removes the redundant .clone() after getitem; safe because advanced indexing already returns a copy independent of ring buffer storage.

Reviews (2): Last reviewed commit: "Address Greptile review: int64 partials ..." | Re-trigger Greptile

Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Outdated
Comment thread source/isaaclab_tasks/isaaclab_tasks/direct/cartpole/cartpole_camera_env.py Outdated
@jmart-nv jmart-nv marked this pull request as draft June 3, 2026 16:58
@jmart-nv

jmart-nv commented Jun 3, 2026

Copy link
Copy Markdown
Author

Moving this back to draft while I rebase onto the cartpole refactor

@jmart-nv jmart-nv force-pushed the jmart/framestack-perf branch from a86c6a7 to 0b0d0ba Compare June 3, 2026 21:30
jmart-nv added a commit to jmart-nv/IsaacLab that referenced this pull request Jun 3, 2026
…y-up)

- images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class)
- images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path)
- circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop
- ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32
jmart-nv added 13 commits June 4, 2026 16:03
…isaac-sim#5574.

- circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property.
- observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates).
- cartpole_camera_presets_env.py: same `.clone()` drop in the direct env.
- delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.
Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.
Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv.

Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.
The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).
- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions
- normalize_image_uint8: docstring warning on the out= aliasing hazard
- CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage
- CircularBuffer.stacked: drop unreachable empty-buffer guard
- _uint8_sum_partials_cache: store the tensor directly, not a tuple
- Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract
…y-up)

- images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class)
- images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path)
- circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop
- ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32
@jmart-nv jmart-nv force-pushed the jmart/framestack-perf branch from 0b0d0ba to 5b3e944 Compare June 4, 2026 21:08
@jmart-nv

jmart-nv commented Jun 4, 2026

Copy link
Copy Markdown
Author

After rebasing and refactoring to support the changes in #5930 which moved the channel dim to BCHW instead of BHWC.

  • FPS shows further improvement at 69 fps vs 54 fps baseline
  • VRAM shows marginal regression, up to 4.57 GB post-rebase vs 4.41GB pre-rebase (due to extra permute() each step)
  • Still converges around ~295 reward

Also addressed @greptile's concerns and profiled the fixes.

Updated description with latest numbers and plot. Moving this back to PR.

@jmart-nv jmart-nv marked this pull request as ready for review June 4, 2026 21:10
@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The circular_buffer.py fix was cleanly applied — the .stacked property now correctly distinguishes the two None cases exactly as suggested. Thanks for that.

Regarding the _get_observations(clone, normalize) concern, since both comments went stale on the rebase, could you share whether the optimization params were factored out of the public signature, or did the rebase restructure things in a way that made the concern moot (e.g., the BCHW refactor changed how _get_observations is overridden)?

The perf numbers look excellent — 69.5 FPS vs 54.6 baseline is a meaningful win, and the VRAM regression from the extra permute() post-rebase is a reasonable trade for BCHW compatibility.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

The shape of the appended data is expected to be (batch_size, ...), where the first dimension is the
batch dimension. Correspondingly, the shape of the ring buffer is (max_len, batch_size, ...).

When ``stack_dim`` is set, the internal layout is rearranged so that :attr:`stacked`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (4df4101): New commit is a minor docstring fix (:paramref: → inline code formatting in circular_buffer.py). No new issues introduced.

✅ Previous P1 concern (misleading RuntimeError) was already addressed.
ℹ️ Previous P2 suggestion (factoring _get_observations) remains open as a non-blocking style note.

LGTM — no further action needed from my side.

@pbarejko pbarejko merged commit 0420e17 into isaac-sim:develop Jun 5, 2026
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants